Quantifying the Effect of Missing Values on Model Accuracy in Supervised Machine Learning Models

Gerhard Svolba
12 min readFeb 3, 2022

--

Problem description

Observations with missing values cannot be used by many supervised machine learning techniques like regression methods or neural network methods. Decision trees can handle missing values in the input data because they consider them as a separate category in the data and assign them to one of the branches.

The analyst faces the challenge of deciding whether to skip a certain record for the analysis or whether to impute the missing value with a replacement value. Here, however, the analyst must decide which percentage of missing values for a given variable can still be imputed. Imputing a variable with 10% missing values is not considered a problem. But what about the situation with 30%, 50%, or more missing values?

Simulation studies have been performed to provide insight and give a rough guidance on how different proportions of missing values that are imputed affect model quality. The simulation studies also show whether there is a difference between random or systematic missing values and if just the training data or both the training and the scoring data contain missing values.

Results summary

summary of 4 different simulation scenarios, hit rate per missing value percentage
  • Random missing values in training data only have limited effect (red line)
  • Missing values that occur also in the scoring data have a larger effect (orange vs. red, light green vs. green)
  • Systematic missing values have a much larger effect (green line)

Takeaway

Not only discuss the “acceptable percentage of missing values” in your data.
Discuss the “why” they are missing and whether this also occurs in scoring.

Main simulation scenarios

Random and systematic missing values

It is important to differentiate between random and systematic missing values. Random missing values are assumed to occur for each analysis subject with the same probability irrespective of the other variables. Systematic missing values are not assumed to occur for each analysis subject with the same probability.

For the simulations, for simplicity systematic missing values have been created in the data by defining 10 equal-sized clusters of analysis subjects. Specific to the definition of the respective simulation scenario, the variables for all observations in one or more of these clusters are set to missing.

Missing values in the scoring data partition

The simulations that are performed to assess the effect of missing values in the input variables must be differentiated by whether the data partition that is used to evaluate the model quality also has missing values inserted.

The simulations have been run for the following two cases:

  1. No insertion of missing values into the scoring data.
  • This scenario mirrors the perfect world in the application of the model, which means that data quality problems only occur in model training. In the application of the models, no data quality problems occur.
  • This occurs frequently in situations where models are built on historic data that are extracted from different operational or analytical systems from the current one or data are extracted from time periods where data quality measures were not yet in place.

2. Insertion of missing values into the training data and the scoring data.

  • This scenario refers to situations where data quality problems like missing values occur not only in the model training phase but also during the application of the model.
  • This case reflects the situation where data quality problems could not (yet) be fixed and are present in the training and scoring data.
  • The missing values in the scoring data are replaced by the imputation logic that has been defined in the model training phase with the Impute node. The respective imputation logic is part of the SAS Enterprise Miner score code.

Simulation Procedure

Preprocessing

For the simulations a supervised machine learning task with a binary target variable has been used. The data are taken from four real life datasets from different industries.

In order to have a “perfect” start dataset for the simulations the data have be preprocessed in the following way. This leads to a training dataset with no missing values.

  • If a variable has has more than 5 % of missing values → the variable is dropped
  • Observations with missing values for the remaining variables (≤ 5% of missing values) are removed from the analysis.

Multiple modeling cycles are run to retrieve a stable model with good predictive power. The list of list of variables of this variable are frozen for the usage in simulations.

Running the simulations

The following graph shows the procedure for the simulations and their evaluation. Data are split to training and validation data. The test partition is used as “scoring data” to mirror a data set which has not been used in model training. This allows to evaluate the effect of missing values in model scoring.

Different types of missing values (random, systematic) are introduced in the data (training data only, training and scoring data). The quantity of missing values is increased in the simulation runs in 10%-steps (0 %, 10%, 20%, … 90%).

Tree surrogate imputation (as implemented in the IMPUTE node in SAS Enterprise Miner, find more details at the end of this article) is used to impute the missing values.

Finally the regression model with the frozen set of variables is trained on this data and the results are evaluated.

Possible bias in the models in the simulation scenarios

Note that providing a predefined set of optimal variables for the predictive models in the different simulation scenarios generates a bias toward too optimistic model performance. The model does not need to find the optimal set of predictor variables. For data with data quality problems, such as not enough observations, high numbers of missing values, bias in the input data, and selection of the optimal predictor variables may not result in the best data set.

Data quality can impact the selection of the optimal input. Thus, the simulation scenarios are influenced by this a priori variable selection (and, consequently, on a priori knowledge). Nevertheless, it has been decided to provide a predefined set of variables in order to compare apples to apples and to remove possible bias in the scenario results caused by differences in the variable selection.

Simulation Results

Validation of the results for the various scenarios is based on the %Response in the 5% of the cases with the highest event probability. The percentage of correctly predicted event cases compared to all cases in this group is used to compare the model accuracy between scenarios.

Random missing values only in the training data

Box plot for %Response for different percentages of random missing values in the
training data

The first simulation scenario shows the effect of random missing values in the training data. Here, however, the scoring data are not affected by missing values. As already mentioned in the introduction, this reflects the situation where data already have good quality (for example, due to a system change or previous data quality efforts); however, the historic data that need to be used for model development (model training) still contain missing values that cannot be fixed retrospectively.

It can be seen from the boxplot that there is surprisingly little decrease in model quality with the increasing percentage of missing values. This, however, shows very clearly that true random missing values indeed create some noise in the training data. The model logic that is trained on the basis of these data, however, still comes very close to the optimal model. If the data on which the model will be applied do not contain missing values, the prediction will still be quite accurate.

In their paper “An Overview of Machine Learning with SAS® Enterprise Miner” Patrick Hall et al. describe the application of a denoising autoencoder. They also illustrate that adding some random noise to the training data might “strengthen” the autoencoder to be better able to handle fresh and unseen data. A similar case might occur here in the case where the training data are biased but only in a random way.

Random missing values in the training and scoring data

Box plot for %Response for different percentages of random missing values in the training and scoring data

Here the scenario where the scoring data, on which the model will be applied in production, has missing values, is shown. It can be seen that the picture changes a lot.

While there is almost no drop in model quality in the training data only case for 30% of missing values, there is already a substantial drop in the %Response, from 19.0 to 15.6, when both training and scoring data contain missing values. In relative numbers model quality drops to 81.9% compared to the case where there are no missing values.

Systematic missing values only in the training data

Box plot for %Response for different percentages of systematic missing values in the training data
Box plot for %Response for different percentages of systematic missing values in
the training data

The first simulation scenario in this section shows the effect of systematic missing values that occur only in the training data. As explained previously, this reflects the situation where data already have good quality (for example, due to a system change or previous data quality efforts); however, the historic data that need to be used for model development (model training) still contain missing values that cannot be fixed retrospectively.

The boxplot here shows that unlike the case of random missing values (see above), systematic missing values affect the model quality even if they only occur in the training data.

Whereas random missing values introduce fuzziness into the data, which can still be treated effectively by missing value imputation, systematic missing values introduce an effect into the data that cannot be replaced by missing value imputation because these methods rely on the fact that the values are missing at random.

Systematic missing values in the training and scoring data

The picture changes substantially when the systematic error also occurs in the scoring data. Unlike the previous case, where only the model logic was biased by the missing values, here the scoring data also contain missing values.

This boxplot shows a strong decrease in model quality with an increasing percentage of missing values. It can also be seen that with 70% and more missing values, the %Response rate even falls under the 5% (dotted) line. The 5% line represents the baseline event rate in the training data, which should be achieved by a random model. The systematic missing values in that range influence the model so strongly that it does not even predict as well as a random model.

With only 10% of missing values, the %Response rate drops to 15.6%, which means that almost 18% of the predictive power of the perfect world model is lost

Results discussion and multivariate quantification

The broad range of the four lines over different missing value percentages illustrates how important it is to clearly differentiate between different types of missing values.

summary of 4 different simulation scenarios, hit rate per missing value percentage

Very often the topic of missing values is discussed in analytics as one monolithic subject. However, the numbers presented in the tables and figures here, and especially the line chart, make visually clear that it is wrong to consider all missing values as the same.

In every discussion about missing values in predictive analytics, you should bear in mind whether

  • the missing values occur only in the training data or whether they also occur in the scoring data when the model is applied
  • the missing values are random or systematic by nature

It is important to adequately address the problem of missing values because, otherwise, the influence of missing values may be over- or underestimated.

The first point may be easier to qualify than the second point. In many cases, the analyst knows in advance whether bad data quality in terms of missing values only affects the (historic) training data.

It is sometimes harder to decide whether missing values occur randomly or systematically. Here, business expertise as well as a more detailed analysis of both the profile of missing values and the correlation of the indication of missing values with other variables need to be performed to gain more insight.

Multivariate quantification

In order to quantify the effect of the following three potential influential factors on the model quality measured by %Response, a linear model Response = f (%missing, Systematic MV, MV in Scoring) has been run on these data:

  • the percentage of missing values
  • the binary category “random or systematic” missing values
  • the binary category “missing values only in the training data or in both the training and scoring data”

The outcome in terms of the coefficients for each of the variables is as follows:

  • Intercept: 19.29 → This represents the estimate for 0% of missing values when only the training data are biased.
  • Percentage of missing values: -0.0996 → %Response decreases on average by 0.1 percentage points for each additional percentage point of missing values.
  • ScoringData: -4.23 → %Response decreases on average by -4.23 percentage points if both the training data and the scoring data are biased.
  • RandomSystematic=Random: 3.6 → %Response is on average 3.6 percentage points higher if the missing values occur randomly.

Calculating a business case

Introduction

A fictional reference company is used to calculate a business case based on the outcome of the different simulation scenarios. The change in model quality is transferred into a response rate and the respective profit is expressed in US dollars.

This is done in order to illustrate the effect of different data quality changes from a business perspective. As a note of caution, the numbers in US dollars should only be considered as rough indicators based on the assumption of the simulation scenarios and on the business case as described here. In individual cases, these values and relationships are somewhat different.

The reference company Quality DataCom

The reference company, Quality DataCom, in the following chapters operates in the communications industry. The company has 2 million customers and runs campaigns to promote the cross- and up-sell of its products and services. In a typical campaign, usually about 100,000 customers (5% of the customer base) with the highest response probability are contacted. The average response of a customer to an offer (offer on take or product upgrade) represents a profit of $25.

Assume that you use an analytic model that predicts 19% of the positive responses in the top 5% of the customer base correctly. Response here means that the campaign contact results in a product purchase or upgrade. This leads in total to 19,000 responding customers in this campaign, which generate a profit of $475,000 (19,000 x $25).

Results

For the reference company, as the analysis of the missing value scenario reveal the following:

  • A reduction of missing values from 50% to 30% means an additional profit of $47,500 ($2,375 per percentage point).
  • A reduction of missing values from 30% to 10% provides an additional profit of $62,500. This represents $3,125 per percentage point.
  • These numbers show that for the reference company a reduction of missing values by 10 percentage points equals an additional profit of $20,000 to $30,000 per campaign.

Summary

  • Random missing values in training data only have limited effect
  • Missing values that occur also in the scoring data have a larger effect
  • Systematic missing values have a much larger effect

Not only discuss the “acceptable percentage of missing values” in your data for supervised machine learning.
Discuss the reason why they are missing and whether the fact that they are missing also occurs in scoring.

Webinar presentation

Links

Links to two webinars on related analyses are shown in the text above. My data preparation for data science webinar contains more contributions around this topic.

Medium Articles:

SAS Communities Article: Using SAS Enterprise Miner for Predictive Modeling Simulation Studies

Presentation #102 in my slide collection contains more visuals on this topic.

Chapters 16 and 18 in my SAS Press book “Data Quality for Analytics Using SAS” discuss these topics in more detail.

Appendix: Tree Surrogate Imputation

With the TREE SURROGATE method, replacement values are estimated by analyzing each input as a target in a decision tree, and the remaining input and rejected variables are used as predictors. Additionally, surrogate splitting rules are created. A surrogate rule is a backup to the main splitting rule. When the main splitting rule relies on an input whose value is missing, the next surrogate is invoked. If missing values prevent the main rule and all the surrogates from applying to an observation, the main rule assigns the observation to the branch that is assigned to receive missing values.

--

--

Gerhard Svolba
Gerhard Svolba

Written by Gerhard Svolba

Applying data science and machine learning methods-Generating relevant findings to better understand business processes